exchange at the University of California in Berkeley (San Francisco)
you should consider exchange too!
Doctorate at Oxford 2013-2017
detecting planets with Kepler
Postdoc at New York University 2017-2020
in data science and physics departments
more Kepler stuff, imaging planets, radio stars
Lecturer at the University of Queensland 2021-2024
Just started here - be patient!
Don’t hesitate to ask anything - about stats, careers, or uni
A/Prof Benjamin Pope (me)
Why is an astronomer a statistician?
In my research, I want to
learn how stars work
detect planets around them
develop technology for doing this better
James Webb Space Telescope
All of these problems are data analysis problems.
What do we expect?
you may or may not have a stats background - this is fine!
stats is data science and data science is stats - you want to learn this material well
you may not have much experience with Python or Git - put work into getting good at these!
come to lectures & meet your peers - this is really valuable!
ask questions on the online discussion forum first, and feel free to email too - but it’s great if we can answer a question lots of people might have
What do we mean by statistics?
What is data science?
$$$
it’s stats
What do you mean by statistics?
the Census?
spreadsheets?
averages and standard deviations?
probability theory?
All of these things!
What do I mean by statistics?
Statistics is the science of reasoning about uncertainty.
Data are always noisy and incomplete, and the art here is in properly accounting for this and getting reliable, accurate, precise results.
We’ll study how to
gather data;
visualize data;
summarize data;
fit models to data;
interpret the models;
and make decisions.
Today
Let’s start with how to gather, visualize and summarize data.
Public Datasets
There are a lot of data available in public sources. One is the Australian Bureau of Statistics; we might use some of those later.
But another - which might help you get a job! - is Kaggle, for data science competitions. They host public datasets and you can compete to produce the best models to explain and predict them.
So we see that a data frame has columns, each of which corresponds to some property of the data points, like price, suburb, etc. Every individual house sold is a row in this table.
Selecting Data
Let’s look at only those data from 2021:
#| code-line-numbers: "1|2|3"df.year = pd.to_datetime(df.date_sold).dt.year # this is because dates in csv are strings but we want to extract yeardf21 = df[df.year==2021]print(df21)
Now let’s make the most basic visualization of a dataset - a histogram.
You should almost always do this!
We are going to use another package you are going to learn inside and out: matplotlib.
import matplotlib.pyplot as plt # this makes plots in pythonplt.hist(df21.price/1e6,bins=100);# semicolon to suppress output; /1e6 to make readableplt.xlabel('House Price ($M)') # always label your axes!plt.ylabel('Number of Houses');
Filtering Data
There are some expensive houses in Sydney! Let’s look at the lower end:
realistic = df21[df21.price <5e6]plt.hist(realistic.price/1e6,bins=100);# semicolon to suppress output; /1e6 to make readableplt.xlabel('House Price ($M)') # always label your axes!plt.ylabel('Number of Houses');
Summary Statistics
Let’s talk about the mean, the median and percentiles, and the mode as ways of talking about a distribution.
The mean is defined as \[
\langle{x}\rangle \equiv \frac{1}{N} \sum_{i=1}^{N} x_i
\]
ie this is the contribution to the total, per item
The median is the value of \(x\) such that 50% of samples are higher, and 50% are lower: i.e. the middle of the distribution. More generally, a percentile is defined so that (say) 90% of samples are less than the 90th percentile.
The mode is the most common value.
NumPy
For doing maths like this on data, we want to use numpy, the standard Python package for numerical calculations:
import numpy as np # always as np!
Summary Statistics
Let’s plot the same histogram as above, but showing the summary statistics.
h = plt.hist(realistic.price/1e6,bins=100);# semicolon to suppress output; /1e6 to make readableplt.xlabel('House Price ($M)') # always label your axes!plt.ylabel('Number of Houses')plt.axvline(np.mean(realistic.price/1e6), ls='-',lw=5,color='C0',label ='Mean')plt.axvline(np.median(realistic.price/1e6), ls=':',lw=5,color='C1',label ='Median')for percentile in [10, 90]: # this is a for loop for doing multiple things plt.axvline(np.percentile(realistic.price/1e6,percentile), ls=':',color='C1', lw=3,label =f'{percentile}th Percentile') # this is an f-string for printing thingsmode = np.argmax(h[0])mode_price = h[1][mode]plt.axvline(mode_price, ls='--', color='C2',lw=5, label='Mode')plt.legend()
Relationships
The core thing we want to do in data science is to make inferences from data. This means finding relationships in data to help us predict or understand what is happening.
Side note: infer vs imply. What’s the difference?
Data imply things to us.
We infer things from data.
Trend Lines
Let’s see if we can plot a trend line for prices over time:
years = np.arange(2016,2022,1) # integers up to 2021 - yes 2021means, lowers, uppers = [], [], [] # init empty listsfor year in years: thisdf = df[df.year==year] means.append(np.mean(thisdf.price)) lowers.append(np.percentile(thisdf.price,25)) uppers.append(np.percentile(thisdf.price,75))# make these into arraysmeans, lowers, uppers = np.array(means), np.array(lowers), np.array(uppers)
Plot this
Now we can see a (depressing) trend with time:
plt.plot(years, means/1e6, 'C2--',label='Mean Price')plt.fill_between(years, lowers/1e6, uppers/1e6, color='C2', alpha=0.2,label='25-75 percentile')plt.ylabel('Price ($M)')plt.xlabel('Year')plt.xticks(years)plt.title('Sydney House Prices in 2018-2021')plt.xlim(years.min(),years.max())plt.legend(loc='upper left');
Scatter Plots
What if we want to see how multiple things relate, not just time?
We can use a scatter plot, in which each individual data point is rendered as a dot on whatever axes we like. Let’s see how property size relates to price:
We can do the same comparison for apartments and terraces and overlay them:
Coloured Scatter Plots
We don’t have to be restricted to just representing 2 dimensions: we can even put a colour map on the data to represent a third quantity!
houses.sort_values('km_from_cbd', inplace=True, ascending=True)plt.scatter(houses['property_size'], houses['price']/1e6, alpha=0.8, s=4,c=houses['km_from_cbd'], label='House',cmap='inferno')plt.colorbar(label='Distance from CBD (km)')plt.xlim(0,2000)plt.xlabel('Property Size (sqm)')plt.ylabel('Price ($M)')plt.title('Freestanding House Prices');
Fitting a Line to Data
The most important thing in your whole job will be learning from data: finding a mathematical representation that explains current data and predicts new data.
This can be as simple as fitting a line.
In the next lectures, we’re going to learn how to do this: but let’s see what we mean by that!